Validate checkpoint GPU UUID inputs early#2086
Conversation
Signed-off-by: Aryan <aryansputta@gmail.com>
|
@leofang You were right on both points here.
On your question specifically... I dont' have a strong public facing use case for accepting CUuuid objects here. The intended caller-facing path is the string UUID returned by Device.uuid, and the existing real checkpoint tests already follow that contract. For example, cuda_core/tests/test_checkpoint.py builds migration mappings from devices[i].uuid in _build_rotation_mapping() and runs the driver-backed checkpoint/restore scenarios without mocks. |
|
/ok to test 3261c10 |
|
Signed-off-by: Aryan Putta <aryansputta@gmail.com>
|
@leofang I repushed this as 1520ee3 to retrigger CI because I do not have permission to rerun failed jobs on the upstream repo. I inspected the previous run before repushing: the failures looked unrelated to the checkpoint change (aarch64 sccache/GitHub cache 503, WSL NVML process-name UnicodeDecodeError, Windows jit_lto_fractal driver/backend mismatch, and one IPC flake). Could you approve the new SHA with /ok to test when convenient? |
|
@leofang quick update: I moved this branch to current main as 41507c5 after confirming the previous WSL get_process_name failures were covered by #2118. The effective diff against current main is still only cuda_core/cuda/core/checkpoint.py. Could you approve the new SHA with /ok to test when convenient? |
|
@aryanputta plz be patient and wait for one of us to check the CI. We can trigger re-runs of the failing jobs only, but if you merge with main we have to start from beginning and run all jobs, not just the failing one. |
|
Also our CI is known flakey for a few things and rerun usually solves the problem. |
Sorry about that, @leofang! I was told in the past not to force-push/rebase on shared branches to avoid disrupting reviewers, so I thought moving it to main would be the cleaner approach. I will definitely keep a note for myself to just leave it be and wait for a maintainer to trigger the specific job reruns next time. Also, just wanted to let you know that I fixed the other PR #2074 we were working on whenever you have some time to check it out. Thank you! |
|
/ok to test 1ff00e5 |
That is correct, for the very same reasons: When the CI pipelines are triggered, it's based on a particular snapshot (commit). Any action that results in a new commit (ex: merge with main, rebase, force-push, ...) would forfeit the opportunity of re-running. |
Summary
Testing